Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus
نویسندگان
چکیده
We present in this paper PADIC, a Parallel Arabic DIalect Corpus we built from scratch, then we conducted experiments on crossdialect Arabic machine translation. PADIC is composed of dialects from both the Maghreb and the Middle-East. Each dialect has been aligned with Modern Standard Arabic (MSA). Three dialects from Maghreb are concerned by this study: two from Algeria, one from Tunisia, and two dialects from the MiddleEast (Syria and Palestine). PADIC has been built from scratch because the lack of dialect resources. In fact, Arabic dialects in Arab world in general are used in daily life conversations but they are not written. At the best of our knowledge, PADIC, up to now, is the largest corpus in the community working on dialects and especially those concerning Maghreb. PADIC is composed of 6400 sentences for each of the 5 concerned dialects and MSA. We conducted cross-lingual machine translation experiments between all the language pairs. For translating to MSA we interpolated the corresponding Language Model (LM) with a large Arabic corpus based LM. We also studied the impact of language model smoothing techniques on the results of machine translation because this corpus, even it is the largest one, it still very small in comparison to those used for translation of natural languages. ∗Ecole Normale Supérieure Bouzareah. †Multimedia, InfoRmation systems and Advanced Computing Laboratory. ‡Centre de Recherche Scientifique et Technique pour le Développement de la Langue Arabe.
منابع مشابه
Domain and Dialect Adaptation for Machine Translation into Egyptian Arabic
In this paper, we present a statistical machine translation system for English to Dialectal Arabic (DA), using Modern Standard Arabic (MSA) as a pivot. We create a core system to translate from English to MSA using a large bilingual parallel corpus. Then, we design two separate pathways for translation from MSA into DA: a two-step domain and dialect adaptation system and a one-step simultaneous...
متن کاملExploiting Out-of-Domain Data Sources for Dialectal Arabic Statistical Machine Translation
Statistical machine translation for dialectal Arabic is characterized by a lack of data since data acquisition involves the transcription and translation of spoken language. In this study we develop techniques for extracting parallel data for one particular dialect of Arabic (Iraqi Arabic) from out-ofdomain corpora in different dialects of Arabic or in Modern Standard Arabic. We compare two dif...
متن کاملA study of a non-resourced language: an Algerian dialect
The objective of this paper is to present an under-resourced language related to Arabic. In fact, in several countries through the Arabic world, no one speaks the modern standard Arabic language. People speak something which is inspired from Arabic but could be very different from the modern standard Arabic. This one is reserved for the official broadcast news, official discourses and so on. Th...
متن کاملArabic Dialect Identification Using a Parallel Multidialectal Corpus
We present a study on sentence-level Arabic Dialect Identification using the newly developed Multidialectal Parallel Corpus of Arabic (MPCA) – the first experiments on such data. Using a set of surface features based on characters and words, we conduct three experiments with a linear Support Vector Machine classifier and a meta-classifier using stacked generalization – a method not previously a...
متن کاملLow Resourced Machine Translation via Morpho-syntactic Modeling: The Case of Dialectal Arabic
We present the second ever evaluated Arabic dialect-to-dialect machine translation effort, and the first to leverage external resources beyond a small parallel corpus. The subject has not previously received serious attention due to lack of naturally occurring parallel data; yet its importance is evidenced by dialectal Arabic’s wide usage and breadth of inter-dialect variation, comparable to th...
متن کامل